KMID : 1022420180100030031
|
|
Phonetics and Speech Sciences 2018 Volume.10 No. 3 p.31 ~ p.39
|
|
Corpus-based evaluation of French text normalization
|
|
Kim Sun-Hee
|
|
Abstract
|
|
|
This paper aims to present a taxonomy of non-standard words (NSW) for developing a French text normalization system and to propose a method for evaluating this system based on a corpus. The proposed taxonomy of French NSWs consists of 13 categories, including 2 types of letter-based categories and 9 types of number-based categories. In order to evaluate the text normalization system, a representative test set including NSWs from various text domains, such as news, literature, non-fiction, social-networking services (SNSs), and transcriptions, is constructed, and an evaluation equation is proposed reflecting the distribution of the NSW categories of the target domain to which the system is applied. The error rate of the test set is 1.64%, while the error rate of the whole corpus is 2.08%, reflecting the NSW distribution in the corpus. The results show that the literature and SNS domains are assessed as having higher error rates compared to the test set.
|
|
KEYWORD
|
|
text normailization , non-standard word (NSW) , French , evaluation , corpus , text domain
|
|
FullTexts / Linksout information
|
|
|
|
Listed journal information
|
|
|
|